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Abstract 

We give improved constants for data dependent and 
variance sensitive confidence bounds, called em- 
pirical Bernstein bounds, and extend these inequal- 
ities to hold uniformly over classes of functions 
whose growth function is polynomial in the sam- 
ple size n. The bounds lead us to consider sam- 
ple variance penalization, a novel learning method 
which takes into account the empirical variance of 
the loss function. We give conditions under which 
sample variance penalization is effective. In par- 
ticular, we present a bound on the excess risk in- 
curred by the method. Using this, we argue that 
there are situations in which the excess risk of our 
method is of order 1/n, while the excess risk of 
empirical risk minimization is of order l/y/n. We 
show some experimental results, which confirm the 
theory. Finally, we discuss the potential applica- 
tion of our results to sample compression schemes. 

1 Introduction 

The method of empirical risk minimization (ERM) is so in- 
tuitive, that some of the less plausible alternatives have re- 
ceived little attention by the machine learning community. In 
this work we present sample variance penalization (SVP), a 
method which is motivated by some variance-sensitive, data- 
dependent confidence bounds, which we develop in the pa- 
per. We describe circumstances under which SVP works bet- 
ter than ERM and provide some preliminary experimental 
results which confirm the theory. 

In order to explain the underlying ideas and highlight the 
differences between SVP and ERM, we begin with a discus- 
sion of the confidence bounds most frequently used in learn- 
ing theory. 

Theorem 1 (Hoeffding's inequality) Let Z,Z\,...,Z n be 
Ltd. random variables with values in [0, 1] and let S > 0. 
Then with probability at least 1 — 5 in (Z\, . . . , Z n ) we have 

n ^ V 2n 
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It is customary to call this result Hoeffding's inequality. 
It appears in a stronger, more general form in Hoeffding's 
1963 milestone paper 0]. Proofs can be found in 0] or 
@]. We cited Hoeffding's inequality in form of a confidence- 
dependent bound on the deviation, which is more convenient 
for our discussion than a deviation-dependent bound on the 
confidence. Replacing Z by 1 — Z shows that the confidence 
interval is symmetric about ¥,Z. 

Suppose some underlying observation is modeled by a 
random variable X, distributed in some space X according 
to some law p. In learning theory Hoeffding's inequality is 
often applied when Z measures the loss incurred by some 
hypothesis h when X is observed, that is, 

Z = i h (X) . 

The expectation Ex~^h (A) is called the risk associated 
with hypothesis h and distribution p. Since the risk depends 
only on the function £h and on p we can write the risk as 

p(4,m), 

where P is the expectation functional. If an i.i.d. vector 
X = (Xi, . . . , X n ) has been observed, then Hoeffding's in- 
equality allows us to estimate the risk, for fixed hypothesis, 
by the empirical risk 

P„(4,x) = -V4(x s ) 

i 

within a confidence interval of length 2 \J (In 1/(5) / (2n). 

Let us call the set T of functions £h for all different hy- 
potheses h the hypothesis space and its members hypothe- 
ses, ignoring the distinction between a hypothesis h and the 
induced loss function The bound in Hoeffding's inequal- 
ity can easily be adjusted to hold uniformly over any finite 
hypothesis space T to give the following well known result 
Q. 

Corollary 2 Let X be a random variable with values in a 
set X with distribution p, and let T be a finite class of hy- 
potheses f : X — ► [0, 1] and S > 0. Then with probability at 
least 1 — 6 in X = (Xi, . . . , X n ) ~ p n 

where \F\ is the cardinality of T. 



This result can be further extended to hold uniformly 
over hypothesis spaces whose complexity can be controlled 
with different covering numbers which then appear in place 
of the cardinality \T\ above. A large body of literature exists 
on the subject of such uniform bounds to justify hypothesis 
selection by empirical risk minimization, see [1] and refer- 
ences therein. Given a sample X and a hypothesis space T, 
empirical risk minimization selects the hypothesis 



ERM (X) 



argminP„ (/,X) . 



A drawback of Hoeffding's inequality is that the con- 
fidence interval is independent of the hypothesis in ques- 
tion, and always of order y/TJn, leaving us with a uniformly 
blurred view of the hypothesis class. But for hypotheses of 
small variance better estimates are possible, such as the fol- 
lowing, which can be derived from what is usually called 
Bennett's inequality (see e.g. Hoeffding's paper |41]). 

Theorem 3 (Bennett's inequality) Under the conditions of 
Theorem\l\we have with probability at least 1 — 5 that 



1 " 

EZ V Zi < 

71 ' 



2VZlnl/<5 In 1/6 
n 3n 



where YZ is the variance YZ = E (Z — EZ) . 

The bound is symmetric about EZ and for large n the 
confidence interval is now close to 2\/YZ times the confi- 
dence interval in Hoeffding's inequality. A version of this 
bound which is uniform over finite hypothesis spaces, anal- 
ogous to Corollary 12 is easily obtained, involving now for 
each hypothesis h the variance Yh (X). If hi and are two 

hypotheses then 1^/Yh\ (X) and 2-\/V/i2 (X) are always 
less than or equal to 1 but they can also be much smaller, or 
one of them can be substantially smaller than the other one. 
For hypotheses of zero variance the diameter of the confi- 
dence interval decays as O (1/n). 

Bennett's inequality therefore provides us with estimates 
of lower accuracy for hypotheses of large variance, and higher 
accuracy for hypotheses of small variance. Given many hy- 
potheses of equal and nearly minimal empirical risk it seems 
intuitively safer to select the one whose true risk can be most 
accurately estimated (a point to which we shall return). But 
unfortunately the right hand side of Bennett's inequality de- 
pends on the unobservable variance, so our view of the hy- 
pothesis class remains uniformly blurred. 

1.1 Main results and SVP algorithm 

We are now ready to describe the main results of the paper, 
which provide the motivation for the SVP algorithm. 

Our first result provides a purely data-dependent bound 
with similar properties as Bennett's inequality. 

Theorem 4 Under the conditions ofTheorem\l\we have with 
probability at least 1—8 in the i.i.d. vector Z = (Z\ , . . . , Z n ) 
that 

EZ _lf 1Zi < [W^Vl + 7^2/5 
n V n 3 (n — 1) 



where V n (Z) is the sample variance 
n (n — 1) ' 

l<i<j<n 

We next extend Theorem|4]over a finite function class. 

Corollary 5 Let X be a random variable with values in a 
set X with distribution (i, and let T be a finite class of hy- 
potheses f : X — ► [0, 1]. For S > 0, n > 2 we have with 
probability at least 1 — 5 in X = (Xx, . . . , X n ) ~ /i" that 



P(/,M)-fn(/,X) < 



2K(/,X)ln(2|^|/«5) 



, 71n(2]^l/J) 
3(n-l) 

where V n (f, X) = V n (/ (X,) ,...,/ (X n )). 



Theorem|4]makes the diameter of the confidence interval 
observable. The corollary is obtained from a union bound 
over T, analogous to Corollary [2] and provides us with a 
view of the loss class which is blurred for hypotheses of large 
sample variance, and more in focus for hypotheses of small 
sample variance. 

We note that an analogous result to Theorem [4] is given 
by Audibert et al. [2]. Our technique of proof is new and the 
bound we derive has a slightly better constant. Theorem |4] 
itself resembles Bernstein's or Bennett's inequality, in confi- 
dence bound form, but in terms of observable quantities. For 
this reason it has been called an empirical Bernstein bound 
in @. In J2l Audibert et al. apply their result to the analysis 
of algorithms for the multi-armed bandit problem and in @ 
it is used to derive stopping rules for sampling procedures. 
We will prove Theorem |4] in Section [2] together with some 
useful confidence bounds on the standard deviation, which 
may be valuable in their own right. 

Our next result extends the uniform estimate in Corollary 
|5] to infinite loss classes whose complexity can be suitably 
controlled. Beyond the simple extension involving cover- 
ing numbers for T in the uniform norm H'H^, we can use 
the following complexity measure, which is also fairly com- 
monplace in the machine learning literature fl|], yfl. 

For e > 0, a function class T and an integer n, the 
"growth function" Moo (e, J 7 , n) is defined as 



sup Af(e,J r (x) 



where ^(x) - {(/ ( Xl ) ,...,/ (x n )) : f e T} C M» and 
for A C W the number Af (e, A, IHI^) is the smallest car- 
dinality \Ao\ of a set Aq C A such that A is contained in 
the union of e-balls centered at points in Ao, in the metric 
induced by HW^. 

Theorem 6 Let X be a random variable with values in a 
set X with distribution fi and let T be a class of hypotheses 
f : X [0, 1]. Fix 5 G (0, 1) , n > 16 and set 

M (n) = lONoo (1/n, T, 2n) . 



Then with probability at least \ — 8 in the random vector 
X = (Xi, . . . , X n ) ~ /i™ we have 



P(/,/i)-P„(/,X) < 



18K(/,X)ln(7W ( n ) /(J) 



151n(Aj (n) (S) 
n-1 : 



The structure of this bound is very similar to Corollary|5] 
with 2 \T\ replaced by A4(n). In a number of practical cases 
polynomial growth of Afoo {^/ n , 3~, n) in n has been estab- 
lished. For instance, we quote [3, equation (28)] which states 
that for the bounded linear functionals in the reproducing 
kernel Hilbert space associated with Gaussian kernels one 

has ln.A/'oo (l/n,JF, 2n) = O ^ln 3 ^ 2 n^. Composition with 

fixed Lipschitz functions preserves this property, so we can 
see that Theorem [6] is applicable to a large family of func- 
tion classes which occur in machine learning. We will prove 
Theorem|6]in Section[3] 

Since the minimization of uniform upper bounds is fre- 
quent practice in machine learning, one could consider min- 
imizing the bounds in Corollary [5] or Theorem [6] This leads 
to sample variance penalization, a technique which selects 
the hypothesis 



SVP X (X) = arg min P n (/, X) + xJ Vn (/ ' X) , 
feF V n 

where A > is some regularization parameter. For A = 
we recover empirical risk minimization. The last term on the 
right hand side can be regarded as a data-dependent regular- 
izer. 

Why, and under which circumstances, should sample vari- 
ance penalization work better than empirical risk minimiza- 
tion? If two hypotheses have the same empirical risk, why 
should we discard the one with higher sample variance? Af- 
ter all, the empirical risk of the high variance hypothesis may 
be just as much overestimating the true risk as underestimat- 
ing it. In Section|4]we will argue that the decay of the excess 
risk of sample variance penalization can be bounded in terms 
of the variance of an optimal hypothesis (see Theorem |T3T> 
and if there is an optimal hypothesis with zero variance, then 
the excess risk decreases as 1/n. We also give an example of 
such a case where the excess risk of empirical risk minimiza- 
tion cannot decrease faster than O (1/y/n). We then report 
on the comparison of the two algorithms in a toy experiment. 

Finally, in Section [5] we present some preliminary ob- 
servations concerning the application of empirical Bernstein 
bounds to sample-compression schemes. 

1.2 Notation 

We summarize the notation used throughout the paper. We 
define the following functions on the cube [0, 1]™, which will 
be used throughout. For every x = (xi, . . . , x n ) G [0, 1]" 
we let 



Pn (X) = - 



and 



V n (x) = 



(Xj Xj) 



If X is some set, / : X — > [0, 1] and x = (xi, . . . , x n ) G 
X n we write / (x) = (/ (x x ) , . . . , / (x„)), P n (/, x) = 
P n (/(x))andK i (/,x) = K 1 (/(x)). 

Questions of measurability will be ignored throughout, 
if necessary this is enforced through finiteness assumptions. 
If X is a real valued random variable we use EX and YX 
to denote its expectation and variance, respectively. If X 
is a random variable distributed in some set X according to 
a distribution fi, we write X ~ p,. Product measures are 
denoted by the symbols x or J\, /i™ is the n-fold product 
of [i and the random variable X = (Xi, . . . ,X n ) ~ fi n 
is an i.i.d. sample generated from /i. If X ~ fi and / : 
X -> M then we write P (/, /x) = Ex^^f {X) = Ef (X) 
mdV(f, f i)=V x „ fl f(X) = Vf (X). 

2 Empirical Bernstein bounds and variance 
estimation 

In this section, we prove Theorem|4]and some related useful 
results, in particular concentration inequalities for the vari- 
ance of a bounded random variable, (O and (|6]l below, which 
may be of independent interest. For future use we derive our 
results for the more general case where the Xj in the sample 
are independent, but not necessarily identically distributed. 

We need two auxiliary results. One is a concentration 
inequality for self -bounding random variables (Theorem 13 
inO]): 

Theorem 7 Let X = (Xi, . . . , X n ) be a vector of inde- 
pendent random variables with values in some set X. For 
1 < k < n and y G X, we use X a fc to denote the vector 
obtained from X by replacing X^ by y. Suppose that a > 1 
and that Z = Z (X) satisfies the inequalities 

Z(X)- inf Z(Xy tk ) < l,Vfc (1) 



almost surely. Then, for t > 0, 



Pr {EZ - Z > t} < exp 



< aZ (X) (2) 



2aEZ 



If Z satisfies only the self-boundedness condition (0 we still 
have 



Pr {Z - EZ > t} < exp 



-t/ 



V 2aEZ + at 



The other result we need is a technical lemma on condi- 
tional expectations. 

Lemma 8 Let X, Y be i.i.d. random variables with values 
in an interval [a. a + 1]. Then 

E x \e y (X - Yf] 2 < (1/2) E (X - Yf . 

Proof: The right side of the above inequality is of course the 
variance E [X 2 — XY] . One computes 



n (n — 1) 4-^ 



E 



x 



E Y (X -Yf = E [X 



3X 2 Y 2 



4X 3 Y] 



We therefore have to show that E [g (X, Y)] > where 

g (X, Y) = X 2 - XY - X 4 - 3X 2 Y 2 + 4X 3 Y 

A rather tedious computation gives 

g(X,Y)+g(Y,X) = 

= X 2 - XY - X 4 - 3X 2 Y 2 + 4X 3 Y+ 

+ Y 2 - XY -Y 4 - 3X 2 Y 2 + 4Y 3 X 

= (X - Y + 1) (Y - X + 1) (Y - X) 2 . 

The latter expression is clearly nonnegative, so 

2 [Eg (X, Y)] = E [g (X, Y) + g (Y, X)] > 0, 

which completes the proof. ■ 

When the random variables X and Y are uniformly dis- 
tributed on a finite set, {x\, . . . , x n }, Lemma[8]gives the fol- 
lowing useful corollary. 

Corollary 9 Suppose {x\, . . . , x n } C [0, 1]. Then 

k \ j j k,j 

We first establish confidence bounds for the standard de- 
viation. 

Theorem 10 Let n > 2 and X = (X\, . . . , X n ) be a vector 
of independent random variables with values in [0, 1]. Then 
for 5 > we have, writing EV n for ExY n (X), 



Pr|/MX)> VEK + y^^j 



< S (3) 

< S. (4) 



Proof: Write Z (X) = nV n (X). Now fix some k and 
choose any y £ [0, 1]. Then 

Z (X) - Z (X lMk ) = 

= ^ L jz2{( x *- x rf-(y- x if 



^^tE(^-^) 2 



It follows that Z (X) - mf yen Z (X Vtk ) < 1- We also get 



? \ z (X) 2 (X " 4, J s 



n 3 I 
(n~T) 

Z(X), 



where we applied Corollary [9] to get the second inequality. 
It follows that Z satisfies ([1) and © with a = n/ (n — 1). 
From TheoremQand 

Pr{±EK T K, (X) > s} = Pr {±EZ T Z (X) > ns} 
we can therefore conclude the following concentration result 
for the sample variance: For s > 

Pr{EK~K(X)> S } < exp( ~ {n n ~ T ! )s2 ) (5) 



2EK 

Pr{K(X)-EK>4 < cxp( IV 1 ^ 2 ) •'<» 



2EV n + s 

From the lower tail bound (0 we obtain with probability at 
least 1 - S that 

EV n - 2v^kJ lnl/J < K (X) . 

V 2 (n - 1) 

Completing the square on the left hand side, taking the square- 
root, adding -y/ln (1/5) / (2 (n — 1)) and using \J a + b < 
sfa + \fb gives OJ. Solving the right side of © for s and 
using the same square-root inequality we find that with prob- 
ability at least 1 — 5 we have 



V n (X) < EV n +2* 



IEV„lnl/d lnl/5 



2(n-l) (n-1) 



VEVn 



I ]nl/S \ In 1/6 
2 (n - 1) j + 2(n- 1)' 



Taking the square-root and using the root-inequality again 
gives ©. ■ 

We can now prove the empirical Bernstein bound, which 
reduces to Theorem|4]for identically distributed variables. 

Theorem 11 Let X = [X\, . . . , X n ) be a vector of inde- 
pendent random variables with values in [0, 1]. Let 5 > 0. 
Then with probability at least 1 — 5 in X we have 



E[P„(X)] <P„(X) 



2F„(X)ln2/<5 7 In 2/6 



n 3 (n — 1) 

Proof: Write W = (1/n) £\ VX, and observe that 



W < -VE(X 4 -EXi) 2 



(7) 



1 



2n (n - 1) V 



V (EX ( - EX,) 2 (8) 



1— £E(X-^) 2 



2n (n - 1) 



= EV„. (9) 
Recall that Bennett's inequality, which holds also if the Xi 
are not identically distributed (see [8]), implies with proba- 
bility at least 1 — 5 



EP n (X) < P n (X) 



< P„(X) 



2t¥lnl/(5 In 1/5 
n 3n 



2EF„lnl/5 In l/<5 



n-1 



?i 3n ' 

so that the conclusion follows from combining this inequal- 
ity with (O in a union bound and some simple estimates. ■ 



3 Empirical Bernstein bounds for function 
classes of polynomial growth 

We now prove Theorem|6] We will use the classical double- 
sample method (O, 11DX but we have to pervert it some- 
what to adapt it to the nonlinearity of the empirical standard- 
deviation functional. Define functions : [0, l]"xR + — > 
Rby 



$(x,t) 



*(x,t) 



Pn (X) + 



Pn (X) + 



214 (x) * 



It 



n 3(n- 1)' 

!8V n (x) t lit 



1 



We first record some simple Lipschitz properties of these 
functions. 

Lemma 12 For t > 0, x, x' e [0, 1]" we have 
(i) $(x,t)-$(x',i) < (l+2VV^)||x-x'|| oo! 
(ii) *(x,t)-*(x',t) < (l + 6v^) llx-x'l^. 



Proof: One verifies that 

^(x)-VK(x') < V^llx-x'l^, 

which implies (i) and (ii). ■ 

Given two vectors x, x' S A"" and cr e { — 1,1 } n de- 
fine (cr, x, x') S A" 1 by (cr, x,x') i = X, if cr, = 1 and 
(cr, x, x')j = x£ if a-,; = — 1. In the following the <7i will 
be independent random variables, uniformly distributed on 

{-Mi- 
Lemma 13 LetX= (Xx,...,X n )andX' = {X[, . . . , X' n ) 
be random vectors with values in X such that all the Xi and 
X[ are independent and identically distributed. Suppose that 
F : X 2n — > [0, 1]. Then 

EF(X,X')< sup E CT F((o-,x,x'),(-o-,x ) x')). 

(x,x')6A- 2 " 

Proof: For any configuration a and (X, X'), the configura- 
tion ((a, X, X') , (-a, X, X')) is obtained from (X, X) by 
exchanging Xi and X[ whenever cr, = — 1. Since Xj and 
X[ are identically distributed this does not affect the expec- 
tation. Thus 

EF(X,X') = E ff EF((<7,X,X'),(-<7,X,X')) 

sup E a F ((cr, x, x') , (— cr, x, x')) . 



Proof: Define the random vector Y = (Yi, ... ,Y n ), where 
the Yi are independent random variables, each Yi being uni- 
formly distributed on {/ (xj) , / (x^)}. The Yi are of course 
not identically distributed. Within this proof we use the short- 
hand notation EP n = E Y P n (Y) and EV n = E Y V n (Y), 
and let 



A = EP n 



8EV n t 



Ut 



n 3 (n — 1) 
Evidently 

Pr {$ (/ (a, x, x') ,*)>*(/ (-cr, x, x') ,t)} < 

a 

< Pr {$(/ (cr, x, x'), t) >A} + 

a 

+ Pr{4>*(/(-cr,x,x'),t)} = 
= p r {$(Y,i) > A} + Pr{A > *(Y,i)}. 

To prove our result we will bound these two probabilities in 
turn. 
Now 

Pr{$(Y,t) > A} < 



< Pr { P n (Y) > EP n 



2EV n t 



3(n- 1) 



;>r 2V n (Y)t j2EV n t , 2t 



n — 1 



Since (Yi)) < nEK by equation ©, the first of 

these probabilities is at most e~* by Bennett's inequality, 
which also holds for variables which are not identically dis- 
tributed. That the second of these probabilities is bounded 
by e~* follows directly from Theorem ITOl (T4l>- We conclude 
that Pr Y {$ (Y, t) > A} < 2e~'. 
Since \f2 + = \/T8 we have 

Pr{A>*(Y,i)}< 



pi _ i JSEKt > JSV n (Y)t , 4< 



n - 1 



(x,x') 



The first probability in the sum is at most 2e * by Theorem 
[TT1 and the second is at most e~* by Theorem[Tol(l3l). Hence 

Pr Y {A > * (/ (Y) , *)} < 3e - *, so it follows that 

Pr {$ (/ (cr, x, x') ,*)>*(/ (-cr, x, x') , t)} < 5e- ( . 



The next lemma is where we use the concentration results 
in Section|2] 

Lemma 14 Let f : X — > [0, 1] and (x, x') S A" 2 " be fixed. 
Then 

Pr {$ (/ (a, x, x') , t) > * (/ x, x') , t)} < 5e~* \ 



Proof of Theorem |6] It follows from Theorem QT] that for 
t > In 4 we have for any f E J 7 that 

Pr{$(/(X),t)>P(/,M)}>l/2. 
In other words, the functional 

/ h+ A (/) = Ex- 1 {* (/ (X) , t) > P (/, M )} 



satisfies 1 < 2A (/) for all /. Consequently, for any s > 
we have, using L4 to denote the indicator function of A, that 

Pr{3/e^:P(/,M) >*(/ (X),t) + s} 



E x sup I {P (/, m) > * (/ (X) , t) + s} 



< E x sup I {P (/, m) > * (/ (X) ,*) + «} 2A (/) 



-2F ™F J ^(/^)>*(/(X),t)+ S 
~~ /gjf I and$(/(X'),t)>P(/,M) 

<2F „ m i/ > * (/ (X) , i) + s 

<2E XX 'SupI| and$(/(X '),t)>P(/, M ) 



< 2E XX ' sup I {$ (/ (X) ,*)>¥(/ (X) , t) + s} 



<2 sup Pr{3/G^:*(/((7,x,x / ),t) 

>*(/(-cx,x } x'),t)+s}, 

where we used Lemma [THin the last step. 

Now we fix (x, x') € A" 2 ™ and let e > be arbitrary. 
We can choose a finite subset To of J 7 such that \To\ < 
AT (e, T, 2n) and that V/ € T there exists f £ To such 



that 



< e and 



< e, for all 



i 6 {l,...,n}. Suppose there exists f E T such that 

$ (/ (a, x, x') ,*)>¥(/ (-a, x, x') , t) + ^2 + e. 

It follows from the Lemma[T2l(i) and (ii) that there must exist 
/£ Jo sucn that 

$(/>,x,x'),0 >*f/(-a,x,x'),t 



We conclude from the above that 

J 3fe T:$(f(a,x,x'),t)> 



< Pr {3/ £f :$ (/ (a, x, x') ,*)>*(/ (-a, x, x') , i) } 

CT 

< ^ Pr{$(/( ( r,x,x'),t)>*(/(-a,x,x'),i)} 

< 57V (e, T, 2n)e~ t , 

where we used Lemma [141 in the last step. We arrive at the 
statement that 



Pr<j3/e J:P(/, M )>*(/(X),t)+ (2 + sy - l< 




< 10A/"(e,.P,2n)e" 



Equating this probability to 5, solving for t, substituting e = 
1/n and using 8-^/t/n < 2t, for n > 16 and t > 1, give the 
result. ■ 

We remark that a simplified version of the above argu- 



ment gives uniform bounds for the standard deviation y/V (/, p), 
using TheoremfTOlPfli and ||3). 

4 Sample variance penalization versus 
empirical risk minimization 

Since empirical Bernstein bounds are observable, have es- 
timation errors which can be as small as O (1/n) for small 
sample variances, and can be adjusted to hold uniformly over 
realistic function classes, they suggest a method which min- 
imizes the bounds of Corollary|5]or Theorem[6] Specifically 
we consider the algorithm 

SVP X (X) = argminP, (/,X) + AW K(/ ' X) , (10) 
feF V n 

where A is a non-negative parameter. We call this method 
sample variance penalization (SVP). Choosing the regular- 
ization parameter A = reduces the algorithm to empirical 
risk minimization (ERM). 

It is intuitively clear that SVP will be inferior to ERM if 
losses corresponding to better hypotheses have larger vari- 
ances than the worse ones. But this seems to be a somewhat 
unnatural situation. If, on the other hand, there are some op- 
timal hypotheses of small variance, then SVP should work 
well. To make this rigorous we provide a result, which can 
be used to bound the excess risk of SVP\. Below we use 
Theorem|6] but it is clear how the argument is to be modified 
to obtain better constants for finite hypothesis spaces. 

Theorem 15 Let X be a random variable with values in a 
set X with distribution a, and let T be a class of hypotheses 
f : X -> [0, 1]. Fix 6 € (0, 1) , n > 2 and set M (n) = 
lOTVoo (l/n,T, 2n) and A = ^18 In (3M (n) ~/5). 

Fix f* G T. Then with probability at least I — 5 in the 
draw o/X ~ fi n , 

p(svPx(x),p)-p(r,u) 



< 



32V (/•,/*) In {3M (n) /6) 



22 In (3M (n) /8) 
+ n- 1 ' 

Proof: Denote the hypothesis SVP\ (X) by /. By Theorem 
[6]we have with probability at least 1 — 5/3 that 



P /,M < Pn /,X + A 



1 



V n [f,X 



15A 2 



< P n (/*,X) + A 



V„(/*,X) 



18 (n- 1) 
15A 2 



n 18 [n - 1) 

The second inequality follows from the definition of SVP\. 
By Bennett's inequality (Theorem[3]) we have with probabil- 
ity at least 1 - 8/3 that 



p„(r,x)<p (/*,/,) 



2 V (/*,//) In 3/5 In 3/6 



n 



3n 



and by TheoremlTO]© we have with probability at least 1 — 
(5/3 that 

y/v n w* , x) < vnKri + y . 

Combining these three inequalities in a union bound and us- 
ing In (3.M (n) /S) > 1 and some other crude but obvious 
estimates, we obtain with probability at least 1 — S 



32V(f*,[i) ln(37W (n) /S) 



221n(37W (n) /S) 



If we let /* be an optimal hypothesis we obtain a bound 
on the excess risk. The square-root term in the bound scales 
with the standard deviation of this hypothesis, which can be 
quite small. In particular, if there is an optimal (minimal 
risk) hypothesis of zero variance, then the excess risk of the 
hypothesis chosen by SVP decays as (In At (n)) /n. In the 
case of finite hypothesis spaces Ai(n) — \ J- \ is independent 
of n and the excess risk then decays as 1/n. Observe that 
apart from the complexity bound on T no assumption such 
as convexity of the function class or special properties of the 
loss functions were needed to derive this result. 

To demonstrate a potential competitive edge of SVP over 
ERM we will now give a very simple example of this type, 
where the excess risk of the hypothesis chosen by ERM is of 
order 0(1/Vn). 

Suppose that T consists of only two hypotheses T = 
{ c i/2 7 bi/2+ e }- The underlying distribution /i is such that 
c i/2 (X) = 1/2 almost surely and b 1 / 2 + e (X) is a Bernoulli 
variable with expectation 1/2 + e, where e < 1/ The hy- 
pothesis C\ [2 is optimal and has zero variance, the hypothesis 
^i/2+e h as excess risk e and variance 1/4 — e 2 . We are given 
an i.i.d. sample X = (Xi, . . . , X n ) ~ /i n on which we are 
to base the selection of either hypothesis. 

It follows from the previous theorem (with /* = c 1 / 2 ), 
that the excess risk of SVP\ decays as 1/n, for suitably cho- 
sen A. To make our point we need to give a lower bound for 
the excess risk of empirical risk minimization. We use the 
following inequality due to Slud which we cite in the form 
given in Q|, p. 363]. 

Theorem 16 Let B be a binomial (n,p) random variable 
with p < 1/2 and suppose that np < t < n (1 — p). Then 



Pr{B ><} > PrlZ > 



t — np 



y/np(l-p) j ' 



where Z is a standard normal N (0, l)-distributed random 
variable. 

Now ERM selects the inferior hypothesis &i/2+ e if 
P„(6i /a+e ,X) < P n (c 1/2 ,X) = 1/2. 
We therefore obtain from Theorem[T6l with 

B = n (1 - P n (b 1/2+e (X))), 



p = 1/2 - e and t = n/2 that 

Pr {ERM (X) = 6 1/2+e } = Pr {P n (& 1/2+e (X)) < 1/2} 

> Pr{B>t} 



> PrtZ > 



A well known bound for standard normal random variables 
gives for n > 



Pr{Z>n} > f=- V 9 cx|> 



->r 



> exp i-n 2 ) , if t) > 2. 



If we assume n > e 2 we have s/ne/ -\/l/4 — e 2 > 2, : 



Pr {ERM (X) = b 1/2+t } > exp f -- 



L/4-e 2 



> e 



Sue- 1 



where we used e < 1/ v8 in the last inequality. Since this is 
just the probability that the excess risk is e we arrive at the 
following statement: For every n > e~ 2 there exists S (= 
e ) such that the excess risk of the hypothesis generated 
by ERM is at least 



In 1/5 

8n ' 

with probability at least 6. Therefore the excess risk for ERM 
cannot have a faster rate than 0(1/ y/n). 

This example is of course a very artificial construction, 
chosen as a simple illustration. It is clear that the conclu- 
sions do not change if we add any number of deterministic 
hypotheses with risk larger than 1 /2 (they simply have no ef- 
fect), or if we add any number of Bernoulli hypotheses with 
risk at least 1/2 + e (they just make things worse for ERM). 

To obtain a more practical insight into the potential ad- 
vantages of SVP we have conducted a simple experiment, 
where X — [0, 1] A and the random variable X S X is dis- 
tributed according to rife=i a a k ,b k where 

A*a,6 = (1/2) (S a -b + S a +b) ■ 

Each coordinate tt^ (X) of X is thus a binary random vari- 
able, assuming the values — and + bk with equal 
probability, having expectation and variance b 2 . 

The distribution of X is itself generated at random by se- 
lecting the pairs (ak,bk) independently: ak is chosen from 
the uniform distribution on [B, 1 — B] and the standard de- 
viation bk is chosen from the uniform distribution on the in- 
terval [0, B\. Thus B is the only parameter governing the 
generation of the distribution. 

As hypotheses we just take the K coordinate functions 
7Tfc in [0, 1] K . Selecting the fc-th hypothesis then just means 
that we select the corresponding distribution u ak ! j >k . Of course 
we want to find a hypothesis of small risk ak, but we can only 
observe ak through the corresponding sample, the observa- 
tion being obscured by the variance b\. 

We chose 5 = 1/4 and K = 500. We tested the algo- 
rithm (TTOt with A = 0, corresponding to ERM, and A = 2.5. 
The sample sizes ranged from 10 to 500. We recorded the 
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Figure 1: Comparison of the excess risks of the hypothe- 
ses returned by ERM (circled line) and SVP with A = 2.5 
(squared line) for different sample sizes. 



true risks of the respective hypotheses generated, and av- 
eraged these risks over 10000 randomly generated distribu- 
tions. The results are reported in Figure Q] and show clearly 
the advantage of SVP in this particular case. It must however 
be pointed out that this advantage, while being consistent, is 
small compared to the risk of the optimal hypotheses (around 
1/4). 

If we try to extract a practical conclusion from Theorem 
IT~5l our example and the experiment, then it appears that SVP 
might be a good alternative to ERM, whenever the optimal 
members of the hypothesis space still have substantial risk 
(for otherwise ERM would do just as good), but there are 
optimal hypotheses of very small variance. These two con- 
ditions seem to be generic for many noisy situations: when 
the noise arises from many independent sources, but does 
not depend too much on any single source, then the loss of 
an optimal hypothesis should be sharply concentrated around 
its expectation (e.g. by the bounded difference inequality - 
see H), resulting in a small variance. 

5 Application to sample compression 

Sample compression schemes [6] provide an elegant method 
to reduce a potentially very complex function class to a finite, 
data-dependent subclass. With T being as usual, assume that 
some algorithm A is already specified by a fixed function 



A 



X G |J X n ^ A x e T. 



The function As can be interpreted as the hypothesis chosen 
by the algorithm on the basis of the training set S, composed 
with the fixed loss function. For x G X the quantity As (x) 
is thus the loss incurred by training the algorithm from S and 
applying the resulting hypothesis to x. 

The idea of sample compression schemes [g] is to train 
the algorithm on subsamples of the training data and to use 



the remaining data points for testing. A comparison of the 
different results then leads to the choice of a subsample and 
a corresponding hypothesis. If this hypothesis has small risk, 
we can say that the problem-relevant information of the sam- 
ple is present in the subsample in a compressed form, hence 
the name. 

Since the method is crucially dependent on the quality of 
the individual performance estimates, and empirical Bern- 
stein bounds give tight, variance sensitive estimates, a com- 
bination of sample compression and SVP is promising. For 
simplicity we only consider compression sets of a fixed size 
d. We introduce the following notation for a subset / C 
{1, . . . , n} of cardinality |/| = d. 

• Ax[/] = me hypothesis trained with A from the sub- 
sample X[J] consisting of those examples whose in- 
dices lie in /. 

• For / G J-, we let 

Pi- (f) = Pn- d (f (X[/ c ])) = ( Xl ) , 

i<£I 

the empirical risk of / computed on the subsample X [I c ] 
consisting of those examples whose indices do not lie in 
I. 

• For / G we let 

vAf) = v„- d (/(x[/ c ])) 

1 ^ (/(JQ)-/p0)) s 



2(n-d)(n-d-l).^ i 



the sample variance of / computed on X[/ c ]. 

• C = the collection of subsets I C {1, . . . , n} of cardi- 
nality \I\ = d. 

With this notation we define our sample compression scheme 



SVP X (1L) = A x[i] 

I = argminP/c (A X[I] ) + AyV/c (A X [/])- 

As usual, A = gives the classical sample compression 
schemes. The performance of this algorithm can be guar- 
anteed by the following result. 

Theorem 17 With the notation introduced above fix 6 G (0,1). 
n > 2 and set A = \J1 In (6 \C\ JS). Then with probability at 
least 1 — 5 in the draw o/X ~ fi n , we have for every I* G C 

P(SVP x p(.),n)-P{A x[I . ] ,n) 



< 



/ 8y(.4x[/-],/x)ln(6|C|/3) , 14 In (6 |C| /tf) 
n — d 3 (n — d— 1) 



Proof: Use a union bound and Theorem [4] to obtain an em- 
pirical Bernstein bound uniformly valid over all Axm with 
I E C and therefore also valid for SVP\ (X). Then follow 
the proof of Theorem[l5] Since now I* G C is chosen after 



seeing the sample, uniform versions of Bennett's inequality 
and Theorem [10] © have to be used, and are again readily 
obtained with union bounds over C. ■ 

The interpretation of this result as an excess risk bound 
is more subtle than for Theorem[l5j because the optimal hy- 
pothesis is now sample-dependent. If we define 

I* = argminP {A X[I] , y) , 

then the theorem tells us how close we are to the choice 
of the optimal subsample. This will be considerably better 
than what we get from Hoeffding's inequality if the variance 
V (Ax[/*l) A 4 ) is small and sparse solutions are sought in the 
sense that d/n is small (observe that In \C\ < din (ne/d)). 

This type of relative excess risk bound is of course more 
useful if the minimum P (Axrr*]) MJ i s close to some true 
optimum arising from some underlying generative model. In 
this case we can expect the loss Ax[/»] to behave like a noise 
variable centered at the risk P (Ax[j*l j A*) ■ If trie noise arises 
from many independent sources, each of which makes only a 
small contribution, then Axrpi will be sharply concentrated 
and have a small variance V (A X [/*], /i), resulting in tight 
control of the excess risk. 
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6 Conclusion 

We presented sample variance penalization as a potential al- 
ternative to empirical risk minimization and analyzed some 
of its statistical properties in terms of empirical Bernstein 
bounds and concentration properties of the empirical stan- 
dard deviation. The promise of our method is that, in simple 
but perhaps practical scenarios the excess risk of our method 
is guaranteed to be substantially better than that of empirical 
risk minimization. 

The present work raises some questions. Perhaps the 
most pressing issue is to find an efficient implementation of 
the method, to deal with the fact that sample variance penal- 
ization is non-convex in many situations when empirical risk 
minimization is convex, and to compare the two methods on 
some real-life data sets. Another important issue is to further 
investigate the application of empirical Bernstein bounds to 
sample compression schemes. 
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